A visual context-aware multimodal system for spoken language processing
نویسندگان
چکیده
Recent psycholinguistic experiments show that acoustic and syntactic aspects of online speech processing are influenced by visual context through cross-modal influences. During interpretation of speech, visual context seems to steer speech processing and vice versa. We present a real-time multimodal system motivated by these findings that performs early integration of visual contextual information to recognize the most likely word sequences in spoken language utterances. The system first acquires a grammar and a visually grounded lexicon from a “showand-tell” procedure where the training input consists of camera images consisting of sets of objects paired with verbal object descriptions. Given a new scene, the system generates a dynamic visually-grounded language model and drives a dynamic model of visual attention to steer speech recognition search paths towards more likely word sequences.
منابع مشابه
The multimodal nature of spoken word processing in the visual world: Testing the predictions of alternative models of multimodal integration
Ambiguity in natural language is ubiquitous (Piantadosi, Tily & Gibson, 2012), yet spoken communication is effective due to integration of information carried in the speech signal with information available in the surrounding multimodal landscape. However, current cognitive models of spoken word recognition and comprehension are underspecified with respect to when and how multimodal information...
متن کاملSpontaneous Speech Recognition Using Visual Context-Aware Language Models
The thesis presents a novel situationally-aware multimodal spoken language system called Fuse that performs speech understanding for visual object selection. An experimental task was created in which people were asked to refer, using speech alone, to objects arranged on a table top. During training, Fuse acquires a grammar and vocabulary from a “show-and-tell” procedure in which visual scenes a...
متن کاملA comprehensive model of spoken word recognition must be multimodal: Evidence from studies of language mediated visual attention
When processing language, the cognitive system has access to information from a range of modalities (e.g. auditory, visual) to support language processing. Language mediated visual attention studies have shown sensitivity of the listener to phonological, visual, and semantic similarity when processing a word. In a computational model of language mediated visual attention, that models spoken wor...
متن کاملLanguage as a multimodal phenomenon: implications for language learning, processing and evolution.
Our understanding of the cognitive and neural underpinnings of language has traditionally been firmly based on spoken Indo-European languages and on language studied as speech or text. However, in face-to-face communication, language is multimodal: speech signals are invariably accompanied by visual information on the face and in manual gestures, and sign languages deploy multiple channels (han...
متن کاملMaking Relative Sense: From Word-Graphs To Semantic Frames
Scaling up from controlled single domain spoken dialogue systems towards conversational, multi-domain and multimodal dialogue systems poses new challenges for the reliable processing of less restricted user utterances. In this paper we explore the feasibility to employ a general purpose ontology for various tasks involved in processing the user’s utterances.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003